AITopics | human score

Collaborating Authors

human score

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Wang, Xiao, Larionov, Daniil, Wu, Siwei, Liu, Yiqi, Eger, Steffen, Moosavi, Nafise Sadat, Lin, Chenghua

arXiv.org Artificial IntelligenceNov-25-2025

Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.

computational linguistic, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2504.02106

Country:

Asia (1.00)
North America > United States (0.68)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment

Zhang, Yulong

arXiv.org Artificial IntelligenceOct-28-2025

We present OCR-Quality, a comprehensive human-annotated dataset designed for evaluating and developing OCR quality assessment methods. The dataset consists of 1,000 PDF pages converted to PNG images at 300 DPI, sampled from diverse real-world scenarios, including academic papers, textbooks, e-books, and multilingual documents. Each document has been processed using state-of-the-art Vision-Language Models (VLMs) and manually annotated with quality scores using a 4-level scoring system (1: Excellent, 2: Good, 3: Fair, 4: Poor). The dataset includes detailed source information, annotation guidelines, and representative cases across various difficulty levels. OCR-Quality addresses the critical need for reliable OCR quality assessment in real-world applications and provides a valuable benchmark for training and evaluating OCR verification systems. The dataset is publicly available at https://huggingface.co/datasets/Aslan-mingye/OCR-Quality .

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.21774

Country: Asia > China (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.49)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback

Quantitative LLM Judges

Sahoo, Aishwarya, Karnuthala, Jeevana Kruthi, Budhwani, Tushar Parmanand, Agarwal, Pranchal, Vaidyanathan, Sankaran, Siu, Alexa, Dernoncourt, Franck, Healey, Jennifer, Lipka, Nedim, Rossi, Ryan, Bhattacharya, Uttaran, Kveton, Branislav

arXiv.org Artificial IntelligenceOct-24-2025

LLM-as-a-judge is a framework where a large language model (LLM) evaluates the output of another LLM. While LLMs excel at producing qualitative textual evaluations, they often struggle to predict human preferences and numeric scores. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain using regression models. The models are trained to improve the score of the original judge using its rationale and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in practice. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2506.02945

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)

Add feedback

MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

Luera, Reuben A., Rossi, Ryan, Dernoncourt, Franck, Basu, Samyadeep, Kim, Sungchul, Mukherjee, Subhojyoti, Mathur, Puneet, Zhang, Ruiyi, Kil, Jihyung, Lipka, Nedim, Yoon, Seunghyun, Gu, Jiuxiang, Wang, Zichao, Bearfield, Cindy Xiong, Kveton, Branislav

arXiv.org Artificial IntelligenceOct-13-2025

In an ideal design pipeline, user interface (UI) design is intertwined with user research to validate decisions, yet studies are often resource-constrained during early exploration. Recent advances in multimodal large language models (MLLMs) offer a promising opportunity to act as early evaluators, helping designers narrow options before formal testing. Unlike prior work that emphasizes user behavior in narrow domains such as e-commerce with metrics like clicks or conversions, we focus on subjective user evaluations across varied interfaces. We investigate whether MLLMs can mimic human preferences when evaluating individual UIs and comparing them. Using data from a crowdsourcing platform, we benchmark GPT-4o, Claude, and Llama across 30 interfaces and examine alignment with human judgments on multiple UI factors. Our results show that MLLMs approximate human preferences on some dimensions but diverge on others, underscoring both their potential and limitations in supplementing early UX research.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.08783

Country: North America > United States > California (0.29)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Services > e-Commerce Services (0.34)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Estimating Machine Translation Difficulty

Proietti, Lorenzo, Perrella, Stefano, Zouhar, Vilém, Navigli, Roberto, Kocmi, Tom

arXiv.org Artificial IntelligenceAug-29-2025

Machine translation quality has steadily improved over the years, achieving near-perfect translations in recent benchmarks. These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas for future improvement. In this context, automatically identifying texts where machine translation systems struggle holds promise for developing more discriminative evaluations and guiding future research. In this work, we address this gap by formalizing the task of translation difficulty estimation, defining a text's difficulty based on the expected quality of its translations. We introduce a new metric to evaluate difficulty estimators and use it to assess both baselines and novel approaches. Finally, we demonstrate the practical utility of difficulty estimators by using them to construct more challenging benchmarks for machine translation. Our results show that dedicated models outperform both heuristic-based methods and LLM-as-a-judge approaches, with Sentinel-src achieving the best performance. Thus, we release two improved models for difficulty estimation, Sentinel-src-24 and Sentinel-src-25, which can be used to scan large collections of texts and select those most likely to challenge contemporary machine translation systems.

machine learning, natural language, translation, (18 more...)

arXiv.org Artificial Intelligence

2508.10175

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Synthetic Data Generation for Phrase Break Prediction with Large Language Model

Lee, Hoyeon, Son, Sejung, Kang, Ye-Eun, Kim, Jong-Hwan

arXiv.org Artificial IntelligenceJul-25-2025

Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction and highlights the potential of LLMs as a viable solution for the speech domain.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2507.18044

Country: Asia (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations

Kim, Hyunjong, Kim, Sangyeop, Jeong, Jongheon, Cho, Yeongjae, Cho, Sungzoon

arXiv.org Artificial IntelligenceJul-1-2025

Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.

explanation, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2506.24016

Country:

Asia (0.93)
North America > United States (0.46)

Genre: Research Report > New Finding (0.93)

Industry:

Government (0.46)
Law > Statutes (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall

Ruiz, Guillermo, Ramírez, Tania, Moctezuma, Daniela

arXiv.org Artificial IntelligenceJan-15-2025

Image captioning has become an essential Vision & Language research task. It is about predicting the most accurate caption given a specific image or video. The research community has achieved impressive results by continuously proposing new models and approaches to improve the overall model's performance. Nevertheless, despite increasing proposals, the performance metrics used to measure their advances have remained practically untouched through the years. A probe of that, nowadays metrics like BLEU, METEOR, CIDEr, and ROUGE are still very used, aside from more sophisticated metrics such as BertScore and ClipScore. Hence, it is essential to adjust how are measure the advances, limitations, and scopes of the new image captioning proposals, as well as to adapt new metrics to these new advanced image captioning approaches. This work proposes a new evaluation metric for the image captioning problem. To do that, first, it was generated a human-labeled dataset to assess to which degree the captions correlate with the image's content. Taking these human scores as ground truth, we propose a new metric, and compare it with several well-known metrics, from classical to newer ones. Outperformed results were also found, and interesting insights were presented and discussed.

artificial intelligence, caption, evaluation, (17 more...)

arXiv.org Artificial Intelligence

2501.09155

Country:

Europe (1.00)
North America > United States > New York (0.28)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs

Wan, Yuxuan, Dong, Yi, Xiao, Jingyu, Huo, Yintong, Wang, Wenxuan, Lyu, Michael R.

arXiv.org Artificial IntelligenceDec-19-2024

Multi-page websites dominate modern web development. However, existing design-to-code methods rely on simplified assumptions, limiting to single-page, self-contained webpages without external resource connection. To address this gap, we introduce the Multi-Page Resource-Aware Webpage (MRWeb) generation task, which transforms UI designs into multi-page, functional web UIs with internal/external navigation, image loading, and backend routing. We propose a novel resource list data structure to track resources, links, and design components. Our study applies existing methods to the MRWeb problem using a newly curated dataset of 500 websites (300 synthetic, 200 real-world). Specifically, we identify the best metric to evaluate the similarity of the web UI, assess the impact of the resource list on MRWeb generation, analyze MLLM limitations, and evaluate the effectiveness of the MRWeb tool in real-world workflows. The results show that resource lists boost navigation functionality from 0% to 66%-80% while facilitating visual similarity. Our proposed metrics and evaluation framework provide new insights into MLLM performance on MRWeb tasks. We release the MRWeb tool, dataset, and evaluation framework to promote further research.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2412.1531

Country: Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(4 more...)

Add feedback

GRACE: Generating Socially Appropriate Robot Actions Leveraging LLMs and Human Explanations

Dogan, Fethiye Irmak, Ozyurt, Umut, Cinar, Gizem, Gunes, Hatice

arXiv.org Artificial IntelligenceSep-25-2024

When operating in human environments, robots need to handle complex tasks while both adhering to social norms and accommodating individual preferences. For instance, based on common sense knowledge, a household robot can predict that it should avoid vacuuming during a social gathering, but it may still be uncertain whether it should vacuum before or after having guests. In such cases, integrating common-sense knowledge with human preferences, often conveyed through human explanations, is fundamental yet a challenge for existing systems. In this paper, we introduce GRACE, a novel approach addressing this while generating socially appropriate robot actions. GRACE leverages common sense knowledge from Large Language Models (LLMs), and it integrates this knowledge with human explanations through a generative network architecture. The bidirectional structure of GRACE enables robots to refine and enhance LLM predictions by utilizing human explanations and makes robots capable of generating such explanations for human-specified actions. Our experimental evaluations show that integrating human explanations boosts GRACE's performance, where it outperforms several baselines and provides sensible explanations.

explanation, robot, robot action, (16 more...)

arXiv.org Artificial Intelligence

2409.16879

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Republic of Türkiye (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback